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Performance optimized linear algebra operations are an essential ingredient for all numerical 
packages. Therefore several standard frameworks are focused on the performance optimized 
implementation of these operations, as for instance several packages implementing the BLAS 
standard. However, next to performance, also reliability, stability, usability, and maintainabil- 
ity are important aspects of such a framework, which in many cases only play a subordinate 
role in the development. 

In this report we introduce the smart expression template programming technique used in 
the math library of the ye rigid multibody physics engine. This technique offers an expression- 
dependent evaluation and optimization of mathematical vector and matrix operations, the 
automatic detection of aliasing effects, and combines the usability and maintainability of stan- 
dard C-| — |- programming with the efficiency of an optimized BLAS library. We will demonstrate 
the efficiency of this approach by comparing the ye math library with the Boost uBLAS library. 


1 Motivation 

A lot of CPU time in numerical simulations is spent in the execution of basic linear algebra subprograms 
(BLAS), such as vector additions, matrix- vector multiplications, and matrix-matrix multiplications. Ob- 
viously, the performance optimized implementation of these operations is crucial for efficient simulations. 
For this purpose the BLAS standard was introduced to define a set of the most important linear algebra 
operations. BLAS is focused on providing an interface for highly optimized linear algebra functionality. 
Since performance is platform dependent, the BLAS standard has been implemented in various libraries, 
as for instance ATLAS [2], Intel’s IMKL [10], AMD’s ACML [1], or NVidia’s CUBLAS [13]. All of these 
frameworks implement the set of BLAS operations defined by the BLAS standard according to its in- 
terface. However, all of them are primarily focused on performance, yet neglect several other important 
aspects of software frameworks, such as usability and maintainability. The following code example shows 
the signature of the C interface function for the multiplication of two dense, double precision matrices 
according to the expression C = a ■ A ■ B + (3 ■ C: 


Listing 1: Signature of the BLAS function for the matrix-matrix multiplication for double precision matrices 
void cblas_dgemm ( const enum CBLAS.ORDER Order, const enum CB LAS .TRANSPOSE TransA , 
const enum CBL AS .TRANSPOSE TransB , const int M, const int N, 
const int K, const double alpha, const double *A , 
const int Ida, const double *B , const int ldb , 
const double beta, double *C, const int ldc ); 

The trouble with the BLAS functions is the lack of usability and maintainability. The generality of the 
function formulation due to the language deficiencies of C and the goal to enable all kinds of matrix-matrix 
multiplications makes it harder to use this function and still harder for maintainers to understand the 
performed operation. Usability is also limited due to the general restriction to the two data types float 
and double 1 and the restriction to dense data structures. Integral data types and mixed types, as for 
instance for the multiplication between a float and a double matrix and sparse data structures are not 
supported at all. Additionally, it is up to the programmer to make sure the sizes of the matrices match, 
i.e. the function has no possibility to check for possible errors. 

Listing 2: Usage of the BLAS matrix-matrix multiplication 
// Definition of the matrices A, B, and C 
double* A = . . . ; 

double* B = . . . ; 

double* C = . . . ; 

// Initialization of the matrices 


II Performing the multiplication 

cblas.dgemm ( CblasRowMaj or , CblasNoTrans , CblasNoTrans , 3, 2, 3, 1.0, A, 3, B, 2, 0, C, 2 ); 

A different approach is taken by the Boost framework’s uBLAS library [4, 3]. This library is not pri- 
marily focused on high performance but on providing user-friendly and maintainable BLAS level 1, 2, and 
3 functionality for dense, packed, and sparse vectors and matrices. Their C+- 1- design and implementa- 
tion unifies mathematical notation via operator overloading and efficient code generation via expression 
templates. 


Listing 3: Matrix-matrix multiplication using the Boost uBLAS library 

II Definition of the matrices A, B, and C 


// Performing the multiplication 
C = A * B; 

In contrast to the multiplication via the cblas_dgemm function, the intent of the programmer is per- 
fectly clear in this case: the matrices A and B are multiplied and the result is written to matrix C. The 
complexity of the operation, including all performance optimizations, are hidden behind this intuitive 
interface. Another usability advantage in comparison to the BLAS standard is that even if the matrices 
are changed to single precision matrices (matrix<f loat>), the multiplication itself remains unchanged. 
This is even true for mixed type matrix-matrix multiplications: 

Listing 4: Mixed type matrix-matrix multiplication using the Boost uBLAS library 

II Definition of the matrices A, B, and C 
matrix <double > A, C; 
matrix <float > B; 

// Initialization of the matrices 


// Performing the mixed type multiplication 


1 ln case you also consider complex numbers as data type, single and double precision complex numbers are also supported. 


In addition, even if the sizes of the matrices are changed in the initialization part, the overloaded multi- 
plication operator makes sure that only matrices of matching size are multiplied. This gives an important 
security factor that cannot be provided by the BLAS functions. However, even for the default BLAS func- 
tionality for dense vectors and matrices, the performance of the Boost uBLAS implementation severely 
lacks behind the other BLAS implementations. 

In this report, we introduce the math library of the pe rigid multibody physics engine [11]. This library 
combines the usability and maintainability of a C+- t- code as implemented in the Boost library, and 
the performance of an optimized BLAS implementation. This is achieved by a smart expression template 
based implementation that allows expression dependent optimizations. In the following Section 2, the basic 
idea of expression templates is explained in details, followed by an in-depth analysis of smart expression 
templates in Section 3. Section 4 shows performance results for the pe math library in comparison to the 
Boost uBLAS library and the ATLAS library [2]. Section 5 concludes the report. 

2 Expression Templates 

Expression templates are a C+- 1- template programming technique to improve the performance degrading 
characteristics of the standard operator overloading technique. The expression template technique was 
concurrently invented by Todd Veldhuizen and David Vandevoorde in 1995 [14, 15, 16]. The basic idea 
of expression templates is the delay of the computation of overloaded operators by creating intermediate 
expression objects that represent the result of a numerical operation. Only if these intermediate objects 
are assigned to their destination, the evaluation of the expression is performed. This approach circumvents 
the creation of a temporary object, which usually consists of a memory allocation, a copy operation, and a 
memory deallocation. Since these expression objects can be efficiently inlined by the compiler, expression 
template based C++ code can attain 95-99.5% efficiency in comparison to hand-crafted C code and speed 
improvements of 2-15 times in comparison to conventional C-H- code [8]. 

To illustrate the technique of expression templates, the following example will first recall the classi- 
cal approach of operator overloading, before demonstrating the addition of dense vectors via expression 
objects. 


ypename Type > // Type of the vector elemen 

class Vector 
{ 

public: 

explicit Vector( std::size_t n, Type value=Type() ) 

: n. ( n ) 

, v_( new Type [n] ) 

std : : fill ( v_ , v_+n, value ); 


Vector( con 

: n_C v.n_ ) 

, v_( new Type [n_] ) 

std:: copy ( v . v_ , v.v_+n_, v_ ); 


:tor& rhs ) 
l * thi s ; 


return * this ; 


std : : size.t size () 


Typefe operator []( size_t i ) { return v. [i] ; > 

const Type & operator [] ( size_t i ) const { return v_ [i] ; > 

void resize( std::size_t n ) { /*...*/ > 

II ... 


Type* v_ ; // The dynamically allocated vector elements 


Listing 5 shows the relevant part of the basic implementation of a classical Vector class for elements of 
arbitrary type. A Vector can be either created by explicitly specifying its size n or via copy construction. 
Additionally, the class provides a copy assignment operator, functions for the direct access to the vector 
elements, and a function to change the size of a vector. 

Listing 6: Addition operator for the addition of two vectors 


template < typename Type > 

const Vector<Type> operator+( const Vector <Type >fe lhs , const Vector <Type >& rhs ) 

i 

if( lhs. size () ! = rhs. size Q ) 

throw std : : invalid_argument ( " Vector u sizes u do u not u match " ); 

Vector<Type> tmpC lhs.sizeC) ); 
for( std::size_t i=0; i<lhs . size () ; ++i ) 
tmp [i] = lhs [i] + rhs [i] ; 


Listing 6 shows the according addition operator for the addition of two Vectors 2 . The first step is a 
check whether the sizes of the two vectors match; if not, a std: : invalid_argument exception is thrown. 
Otherwise, a temporary vector tmp is created and filled with the sum of the two involved vectors, before 
it is returned. 

It is this temporary value that is made responsible for the performance penalty of the CH — b implemen- 
tation in comparison to a C or Fortran implementation [12]. The creation of the temporary value results 
in a dynamic memory allocation and deallocation, and in a vector copy operation during the initialization 
of the function return value by the return expression which is accomplished with copy initialization [5]. 
However, note that this is only half of the story since the temporary value doesn’t have to be created in 
case the return value is used to create another vector. In this case, the “named return value optimization” 
(NRV) as described in section 12.1.1c of the ARM [12], is used to optimize the code: 

Listing 7: Application of NRV during copy construction and copy assignment 

1 Vector <double > a, b; 

2 Vector < double > c( a + b ); //No temporary value necessary due to NRV 

4 c = a + b; // Creation of a temporary value 

If a compiler applies the NRV to a code (which is triggered by the presence of an explicit copy construc- 
tor) , the local variable tmp will be replaced by a reference to the eventual destination of the return value 
in the caller. It is as if the function were written as follows: 

Listing 8: Standard implementation of the vector addition operator 

1 template < typename Type > 

2 const Vector <Type > 

3 operator + ( Vector <Type >& dest , const Vector <Type >& lhs, const Vector <Type >& rhs ) 

4 { 

5 II ... 

7 dest . Vector <Type>: : Vector ( lhs.sizeC) ); // Explicit call of the constructor 


2 Note that this particular definition of the addition operator is only able to add two vectors of the same element type. By 
using the MathTrait class template (see Appendix B) it would be easily possible to add two vectors of different element 
types. 


:ihs . size () ; 
i [i] ; 


f or ( std : : size_t i = 0; i< 
tmp [i] = lhs [i] + rhs 


In Listing 7, during the creation of c, the compiler is able to directly copy the result of the addition into 
vector c. However, this is not possible in the case of the assignment, since the copy assignment operator 
is a member function that performs operations similar to a destruction of c followed by a reinitialization 
of it. Optimizing the assignment could therefore severely alter the semantics of the program (i.e. remove 
side effects of the copy assignment operator). Therefore the compiler is forced to create a temporary, 
which results in code similar to the following one: 

Listing 9: Compiler generated code for the copy assignment of vectors 

1 Vector <double > a, b, c; 

2 Vector <double > tmpC a + b ); // NRV optimized addition of a and b into the temporary tmp 

3 c = tmp; // Assignment of the temporary to the vector c 

The performance penalty from the generation of temporaries even increases if several additions are 
concatenated: 


Listing 10: Concatenation of multiple vector additions 
i Vector <double> a, b, c, d; 

3 d = a + b + c; 


In this case a temporary vector is created for every single expression. Therefore the overhead of the 
vector addition grows considerably: one memory allocation and deallocation, one copy operation, and 
one loop execution for every addition expression. This overhead is not avoidable for the classical C++ 
formulation of the vector addition, although the final result could be calculated in a single f or-loop (as it 
is usually done in a plain C implementation): 

Listing 11: C-like implementation of multiple vector additions 

i Vector <double> a, b, c, d; 

3 f or ( size.t i=0; i<size; ++i ) { 

4 d[i] = a [i] + b[i] + c [i] ; 

5 } 


The approach of expression templates is to remove the creation of the temporary object entirely and 
to delay the execution of the expression until it is assigned to its target. Therefore the addition operator 
no longer returns the result of the addition, but a temporary object that acts as a placeholder for the 
addition expression [8]: 

Listing 12: Expression template formulation of the vector addition 

i template < typename A, typename B > 

3 { 

4 public : 

5 explicit Sum ( const A& a, const B& b ) 

e : a_ ( a ) 

7 , b_( b ) 

8 O 

0 std::size_t size () const { return a. . size () ; } 

1 double operator [] ( std::size_t i ) const ■[ return a_ [i] + b_[i]; } 


3 private: 

4 const A & a_ ; // Reference to the left-hand side operand 

5 const B& b_ ; // Reference to the right-hand side operand 


8 template < typename A, typename B > 

9 Sum<A,B> operator+( const A& a, const B& b ) 

20 { 

21 if( a. sizeO ! = b. sized ) 

22 throw std : : invalid_argument ( " Vector u sizes u do u not u match " ); 
24 return Sum<A,B>( a, b ); 


Instead of calculating the result of the addition of two vectors, the addition operator now returns an 
object of type Sum<A,B>, where A and B are the types of the left- and right-hand side operands. The 
only requirements the addition operator poses on A and B are the existence of a subscript operator and 
a size function. The Sum class now temporarily represents the addition, until an assignment operator is 
encountered: 


Listing 13: Expression template assignment operator 

1 template < typename Type > // Type of the vector elements 

2 class Vector 

3 { 

4 public: 

5 II ... 

7 template < typename A > 

8 Vectorfe operator=( const A& expr ) 

9 { 

0 resize ( expr.sizeQ ); 

1 for( std::size_t i=0; i<expr . size () ; ++i ) 

2 v_[i] = expr [i] ; 


This assignment operator is the only other assignment operator of the Vector class next to the copy 
assignment operator (which is strictly necessary due to the management of the dynamically allocated 
memory). Every time an expression object 3 is assigned to a Vector, the assignment operator from List- 
ing 13 is used to handle the assignment. It first resizes the vector accordingly and afterwards traverses the 
elements of the given expression within a single f or-loop. Note that this traversal triggers the evaluation 
of the expression due to the accesses of the values via the subscript operator. Also note that this f or-loop 
is the only f or-loop necessary to evaluate the entire expression. 


Via this formulation based on the inline formulation of all functions and the evaluation within a single 
f or-loop hidden in the assignment operator it is possible to attain the performance of a C-like implemen- 
tation as illustrated in Listing 11. It is even possible to concatenate several additions without the creation 
of any temporary (and still a single f or-loop evaluation): 

Listing 14: Expression template optimized addition of multiple vectors 
1 Vector <double > a, b, c, d; 


3 c=a+b+c; // Evaluated in a single for-loop 

The expression instantiation graph that is created during the compilation of the right-hand side ex- 
pression a + b + c is illustrated in Fig. 1. For the addition of the first two vectors, a temporary object 
of type Sum< Vector<double> , Vector<double> > is instantiated, that represents the addition of the 
two vectors a and b. For the addition with the third vector c, a temporary expression object of type 
Sum< Sum< Vector<double> , Vector<double> > , Vector<double> > is created, the represents the ad- 
dition of all three vectors. 


Sum< Sum< Vector<double> , Vector<double> >, Vector<double> > 


Sum< Vector<double>, Vector<double> > 


/ ~ I ) \ 

Vector<double> J ^ Vector<double> j ^ Vector<double> j 


Figure 1: Expression template instantiation graph for the addition of three vectors. 


3 The thorough reader might notice that due to the signature of this assignment operator all non-vector objects assigned to 
a vector that do not fit the signature of the copy assignment operator will use this assignment operator. The problems 
associated with this fact will be handled in the remainder of this section. 


Note that the expression objects returned instead of the immediate result of the operation are temporary 
objects as well. However, their creation and the according copy operations don’t hurt the performance for 
two reasons. The first reason is the fact that these temporary expression objects don’t contain any member 
data that are expensive to copy. Instead they only contain references to the operands of the according 
numerical operation. The second reason is the aforementioned NRV. Since the expression objects are 
not assigned but instead created, the compiler has leeway to optimize the copy operation away entirely. 
Listing 15 demonstrates this for the addition of three vectors as shown in Listing 14. 

Listing 15: Compiler generated temporaries during the addition of three vectors 
Vector < double > a, b, c, d; 

II Compiler generated temporary for the expression object returned by the addition of 
II vector a and b. Due to NRV no copy operation is required! 
const Sum< Vector <double > , Vector <double > > __tmpl__( a + b ); 

II Compiler generated temporary for the expression object returned by the addition of 
II the first addition of a and b and the third vector c. Again, due to NRV no copy 
// operation is required! 

const Sum< Sum< Vector <double > , Vector <double > >, Vector <double > > __tmp2__( __tmpl__ + c ) 

// Assignment of the right-hand side expression to the left-hand side vector d 
d = tmp2 ; 

In terms of the vector addition, the shown expression template formulation is already complete and 
working. Figure 2 shows a performance comparison between the classical operation overloading technique 
and the demonstrated expression template formulation for the addition of three vectors as performed in 
Listing 14 4 . 
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Figure 2: Performance comparison between the classical operation overloading technique and expression tem- 
plates for the addition of three vectors. The performance was measured on an Intel Core i7 940 CPU 
at 2.93 Ghz (Bloomfiled core) with 8 MByte of shared level three cache using double precision vector 
elements. The executables were compiled with the GNU G+- 1- 4.4.1 compiler (branch revision 150839). 

However, due to the general formulation of the addition operator and the assignment operator of the 
Vector class, this addition operator now also qualifies for additions between completely unrelated types: 

Listing 16: Imperfection of the general expression template formulation 
Matrix <double > A; 
y = A + x; 

Although it should not be possible to add a matrix and a vector, the current formulation of the addition 
operator allows this operation. In order to prevent the arbitrary addition of unrelated types, an Expr base 
class is introduced [8]: 

Listing 17: Final expression template formulation of the vector addition 



‘The 


machine for this performance measurement 


Intel Core2 Quad CPU (Q6700) with 2.66GHz. 



template < typename Type > // Type of the vector 

class Vector : public Expr< Vector<Type> > 

{ 

public : 

II ... 


template < typename A > 

Vectorft operator=( const Expr<A>& expr ) 

{ 

resize ( a.sizeQ ); 

f or ( std : : size.t i=0; i<a.size(); ++i ) 
v_[i] = a [i] ; 


elements 


template < typename A, typename B > 
class Sum : public Expr< Sum<A,B> > 

{ 

public : 

explicit Sum ( const AS a, const B& b ) 

: a_ ( a ) 

, b.( b ) 

o 

double operator [] ( std::size_t i ) const { return a_ [i] + b_ [i] ; } 


private : 

const AS a_ ; // Reference to the left-hand side operand 
const B& b_ ; // Reference to the right-hand side operand 


template < typename A, typename B > 

Sum<A,B> operator+( const Expr<A>& a, const Expr<B>& b ) 

{ 

ifC (" a) . size () != ('b).sizeO ) 

throw std : : invalid_argument ( " Vector u sizes u do u not u match " ); 

return Sum<A,B>( ‘a, ’b ); 


Both the Vector and the Sum class now publicly derive from the Expr base class and are therefore 
considered valid operands for the addition operator. The Expr base class is based on the “Curiously 
Recurring Template Pattern” (CRTP) [16], which enables a type-safe downcast to the dynamic type of 
the expression object. In the implementation of the pe this downcast is explicitly performed via the 
operator”, as for example used in the addition operator. Note that in this improved expression template 
formulation, both the addition operator as well as the assignment operator are now only accepting Expr 
objects and therefore effectively restrict the number of possible arguments to the allowed expressions. 


3 Smart Expression Templates 

Smart expression templates are an extension of the expression template formulation shown in Section 2. 
The term “smart” in this sense refers to the fact that smart expression templates (in contrast to standard 
expression templates) can optimize the expression evaluation depending on the type of the operands (see 
3.1). Additionally, the smart expression templates as implemented in pe are able to detect aliasing effects 
and adapt the expression evaluation accordingly (Section 3.2). Furthermore, smart expression templates 
allow an integration of performance optimized BLAS functionality in order to combine the usability and 
maintainability of the expression templates with the high performance of BLAS (see Section 3.3). 


3.1 Expression Dependent Evaluation 

The expression template formulation as shown in the previous chapter works fine for simple expressions, as 
for instance the addition of vectors, the multiplication between a matrix and a vector, or a matrix-matrix 
multiplication. However, for complex or composite expressions the standard formulation can lead to severe 
performance penalties. One example is the following matrix-vector multiplication: 

Listing 18: Complex matrix-vector multiplication 

Matrix <double > A; 

Vector < double > a, b, c; 

c = A * ( a + b ); 

Using the expression template formulation as shown before, this right-hand side expression would result 
in an expression object that contains references to its two operands: the matrix A and the vector addition 
a + b. However, for the evaluation of this expression, the entire result of the vector addition is needed 
for every single element of the c vector. Therefore each element of the vector addition is evaluated M 
times, where M is the number of rows of the matrix. This unfortunately decreases the performance of the 
expression template formulation considerably. Figure 3 shows the performance results of the composite 
matrix-vector multiplication for the expression in Listing 18. Obviously, the standard formulation of 
expression templates is slower than the classical approach that involves two vector temporaries (one for 
the vector addition, one for the matrix- vector multiplication), although during the expression template 
evaluation no memory allocations/deallocations or expensive copy operations take place. Included in the 
figure are also the performance result for the smart expression template implementation as explained in 
this section and the result for the combination of smart expression templates with a performance optimized 
BLAS implementation. 


Classical Operator Overloading 
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Smart Expression Templates 
Smart Expression Templates (BLAS) 
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Figure 3: Performance comparison between the several implementations of a dense matrix- vector multiplication. 

The performance was measured on an Intel Core i7 949 CPU at 2.93 GHz (Bloomfield core) with 8 
MByte of shared level three cache and using double precision elements for both the matrix and the 
vectors. All executables were compiled with the GNU G++ 4.4.1 compiler (branch revision 150839). 


Apparently it is necessary to distinguish between a matrix-vector multiplication between a matrix and 
a vector and a matrix-vector multiplication between a matrix and a vector expression. Therefore the 
expression classes are updated by a mechanism to detect whether or not the operands are expressions or 
plain objects. Listing 19 shows part of the smart expression template implementation of the matrix-vector 
addition as implemented in the pe: 

Listing 19: Smart expression object for the matrix-vector multiplication 
template< typename MT // Type of the left-hand side dense matrix 
, typename VT > // Type of the right-hand side dense vector 
class DMatDVecMultExpr : public DenseVector< DMatDVecMultExpr <MT , VT> > 

, private Expression 

{ 

private : 

// Result type of the left-hand side dense matrix expression 
typedef typename MT : : ResultType MRT ; 

// Result type of the right-hand side dense vector expression 
typedef typename VT :: ResultType VRT ; 

// Element type of the left-hand side dense matrix expression 
typedef typename MRT : : ElementType MET; 








// Elf 


right-hand side 
ilementType VET; 


II ... 

public: 

// Result type for expression template evaluations, 
typedef typename MathTrait <MRT , VRT > : : Mult Type ResultType ; 

// Data type for composite expression templates 


// Resulting element type 

typedef typename ResultType :: ElementType ElementType ; 

// Member data type of the left-hand side dense matrix expression 
typedef typename MT : : CompositeType Lhs ; 

// Member data type of the right-hand side dense vector expression 

typedef typename SelectType < I sExpres s ion <VT >:: value , const VRT , const VT&>::Type Rhs ; 


itType operator []( 


// ... 

private : 

Lhs mat. ; // Left-hand side dense matrix of the multiplication expression 

Rhs vec_; // Right-hand side dense vector of the multiplication expression 

II ... 


vector (either plain matrices/vectors or matrix/vector expressions). This class is publicly derived from 
the DenseVector class template that works in the same way as the Expr base class of the standard ex- 
pression template formulation and allows a restriction of operands to dense vectors. Additionally, the 
DMatDVecMultExpr class derives privately from the Expression base class. The purpose of this base class 
is solely to mark the DMatDVecMultExpr class as a vector expression in comparison to a plain vector. 

The major difference in comparison to standard expression templates is an additional set of nested 
type definitions that are required for a smart evaluation of expressions. The MRT and VRT defined in a 
private section of the class are merely shortcuts for the result types of the left-hand side and right-hand 
side operands defined by the nested ResultType type definition. The ResultType represents the resulting 
data type in case the expression object is evaluated. Note that these types are never expression types but 
always concrete types (vectors, matrices, ...) that have to be supported by the MathTrait class template 
(see Appendix B). The result type of an expression is defined as the data type resulting from the according 
operation between the left-hand side operand of type RT1 and the right-hand side operand of type RT2. 
The nested type ElementType represents the element type of the expression, that is derived from the 
ResultType of the expression. MET and VET are shortcuts to the element type of the left-hand side and 
right-hand side result types. 

The nested CompositeType type represents the data type that is used in case the expression is used 
in another expression (as for instance another addition, subtraction, etc.). This data type is either 
a reference to the expression itself in case it is not necessary to evaluate the expression in case of a 
composite expression, or it is an according concrete type (in most cases the ResultType) in case it is 
necessary or beneficial to immediately evaluate the expression within a composite expression. In case 
of the DMatDVecMultExpr class, the CompositeType is a reference to DMatDVecMultExpr since it is not 
necessary to evaluate a matrix- vector multiplication prior to any composite evaluation and is not beneficial 
in terms of performance. However, in case of an addition between a dense and a sparse vector, the 
CompositeType is a concrete dense vector type, since it is a huge performance advantage to evaluate the 
expression immediately: 


Listing 20: Composite type of the smart expression object for the dense vector-sparse vector addition 

1 template < typename VT1 // Type of the left-hand side dense vector 

2 , typename VT2 > // Type of the right-hand side sparse vector 

3 class DVecSVecAddExpr : public DenseVector< DVecSVecAddExpr <VT1 , VT2 > > 

4 , private Expression 

5 { 

7 // . . . 

9 // Result type for expression template evaluations . 

o typedef typename MathTr ait <RT1 , RT2 > : : AddType ResultType ; 

2 // Data type for composite expression templates. 

3 typedef const ResultType CompositeType ; 

5 II ... 


In the DMatDVecMultExpr class, the CompositeType is for instance used to determine the member data 
type of the left-hand side operand (Lhs). In case an expression needs to be immediately evaluated, Lhs 
becomes a temporary that holds the result of the left-hand side expression. In case the expression does 
not have to be evaluated immediately, it corresponds to a reference to the left-hand side matrix operand. 

The left-hand side dense matrix operand does not require any special treatment. In the matrix- vector 
multiplication every element of the matrix is used exactly once. Therefore the decision whether or not to 
evaluate the matrix expression is handled by the nested CompositeType. The right-hand side dense vector 
expression on the other hand requires a special treatment. Depending on whether or not the right-hand 
side vector is a plain vector or a vector expression, the right-hand side member Rhs has to be chosen 
differently: in case the right-hand side vector operand is a plain vector, it is enough to store a reference 
to the vector inside the matrix-vector multiplication object. However, if the right-hand side vector is an 
expression type (i.e. derived from the Expression base class) a temporary is created to hold the result of 
the vector expression in order to guarantee an efficient evaluation of the matrix-vector multiplication. 

The type definition of the member data type for the right-hand side dense vector operand Rhs demon- 
strates how the type dependent evaluation of expressions works. The Rhs type is defined as either the result 
type of the vector expression (VRT) or a reference to the right-hand side plain vector via the SelectType 
class template (see Appendix A). The selection between these two types is performed by the IsExpression 
type trait: 


Listing 21: Type trait class for expressions 


1 template < typename T > 

2 struct IsExpressionHelper 

3 { 

4 enum { value = boost :: is_base_of <Expression , T: 

5 !boost::is_base_of<T, Expression : 

6 typedef typename SelectType < value , TrueType , Fals 


: : value && 

:: value >; 
iType>: : Type 


Type ; 


9 template < typename T > 

0 struct IsExpression : public IsExpressionHelper <T> :: Type 

1 { 

2 public: 

3 enum { value = IsExpressionHelper <T> :: value >; 

4 typedef typename IsExpressionHelper <T> :: Type Type; 


The IsExpression type trait class tests whether or not the given type Type is a pe expression tem- 
plate or not. In order to qualify as a valid expression template, the given type has to derive (publicly 
or privately) from the Expression base class. This relationship is tested via the Boost is_base_of type 
trait. In case the given type is a valid expression template, the value member enumeration is set to 1, 
the nested type definition Type is TrueType, and the class derives from TrueType. Otherwise value is set 
to 0, Type is FalseType, and the class derives from FalseType. 

For the evaluation of the right-hand side member type of the matrix-vector multiplication object, the 
SelectType<IsExpression<VT>: : value, const VRT, const VT&> results either in const VRT in case the 
right-hand side vector VT is an expression or in const VT& in case the right-hand side vector is a plain 


vector. 


An obvious difference in comparison to standard expression templates is the creation of temporary 
objects as members of the expression objects. At first sight this might be a potentially expensive endeavor 
since the expression objects are returned by value (i.e. by a copy operation) from the overloaded operators. 
However, also in this case, the NRV optimization removes any potential copy operation and directly 
constructs the object into a compiler generated, temporary value: 

Listing 22: Compiler generated temporaries for the matrix-vector multiplication 

1 Matrix <double > A; 

2 Vector <double> a, b, c; 

4 // Compiler generated temporary for the expression object returned by the addition of 

5 // vector a and b. No copy operation is required due to NRV! 

6 const DVecDVecAddExpr < Vector <double > , Vector <double > > __tmpl__( a + b ); 

8 // Compiler generated temporary for the expression object returned by the multiplication 

9 // of a dense matrix with a dense vector expression. Due to NRV, no copy operation is 

0 // required! 

1 const DMatDVecMultExpr < Matrix <double > , Vector <double > > __tmp2__( A * __tmpl__ ); 

3 // Assignment of the right-hand side expression to the left-hand side vector c 

4 c = — tm P 2 — ; 


3.2 Detection of Aliasing Effects 

The standard expression template formulation has a second weakness that is directly correlated to the 
correctness of the calculations. Consider the multiplication of a dense matrix and a dense vector as 
illustrated in Listing 23: 

Listing 23: Aliasing problem for a matrix-vector multiplication 

1 Matrix <double > A; 

2 Vector < double > x; 

4 x = A * x; // Aliasing effect!! 

In the standard formulation, this calculation would most certainly result in a wrong result. During 
the evaluation of the right-hand side multiplication expression, the values of the left-hand side vector are 
directly updated without the creation of an intermediate, temporary result. Therefore the values of x are 
changed during the expression evaluation although the old values of x are still required. In the standard 
expression template formulation this problem can only be solved by an explicit generation of a temporary, 
since it is not possible to detect such aliasing problems: 

Listing 24: Explicit resolution of aliasing problems 

1 Matrix <double > A; 

2 Vector < double > x; 

4 Vector <double > tmpC A * x ); // Explicit creation of a temporary vector, optimized by NRV! 

5 x = tmp ; // Update of vector x! 


However, it is potentially very dangerous to allow the formulation of incorrect calculations. 5 Therefore 
it is necessary to incorporate the detection of aliasing effects into the expression template evaluation. 
In the pe implementation, this is solved by an additional member function in every vector, matrix, and 
expression class, called isAliased: 

Listing 25: Member function to detect aliasing effects 

1 template < typename MT II Type of the left-hand side dense matrix 

2 , typename VT > // Type of the right-hand side dense vector 

3 class DMatDVecMultExpr : public DenseVector< DMatDVecMultExpr <MT , VT > > 

4 , private Expression 

5 { 

6 II ... 


5 Even a perfect documentation of this issue could not prevent any accidental use of this formulation even by experienced 
programmers. 


// . . 


template < typename T > 

inline bool isAliased( const T* alias ) 
return vec. . isAliased ( alias ); 


Lhs mat. ; // Left-hand side dense matrix of the multiplication expression 

Rhs vec.; // Right-hand side dense vector of the multiplication expression 


// 


The isAliased function of the DMatDVecMultExpr class returns whether any operand of the expression 
is abased with the given address alias. In case of the matrix-vector multiplication, aliasing effects can only 
occur in case the expression is assigned to a vector that is part of the multiplication expression. Therefore 
the function relays the query to the vector (expression). Listing 26 illustrates the implementation of the 
isAliased function in the Vector class: 

Listing 26: Implementation of the isAliased function in the Vector class 
template < typename Type > // Data type of the vector 

template< typename Other > // Data type of the foreign expression 
inline bool Vector <Type >:: isAliased ( const Other* alias ) const 

f 


The initial call to the isAliased function is contained in the assignment operator of the vector and 
matrix classes. For instance, consider the according assignment operator of the Vector class: 

Listing 27 : Smart expression template assignment operator of the Vector class 
template < typename VT > // Type of the right-hand side dense vector 

inline Vectorfe Vector : : operator=( const DenseVector <VT>& rhs ) 

{ 

using pe : : assign ; 

if ( I sExpression <VT>:: value && ( ' rhs ) . i s A1 iased ( this ) ) { 

Vector tmpC rhs ); 
swap ( tmp ) ; 

> 

else { 

resize ( (“rhs ). size (> » false ); 
assign ( *this , "rhs ); 


return *this ; 


This smart expression assignment operator represents the direct replacement of the standard expression 
template assignment operator demonstrated in Listing 17. It exclusively accepts dense vectors and dense 
vector expressions, i.e. classes deriving from the DenseVector class. The assignment operation distin- 
guishes between a default assignment and an aliased assignment. In case the given right-hand side dense 
vector is an expression, an alias test is performed with the address of the left-hand side vector. In case 
this test returns true, it is necessary to create a temporary and afterwards update the values of the vector 
via the “temporary swap” idiom [7, 6] . If no aliasing is detected, the vector is resized accordingly and the 
right-hand side is directly assigned without any temporary (the assign functionality is discussed in the 
next section). 

Note that self-assignment between two vectors (x = x) is handled by the copy assignment operator. 
Therefore the IsExpression<VT>: : value expression should be considered a performance optimization 
since it is not possible to encounter any aliasing effects in case the right-hand side dense vector is not 
an expression. Also note that this implementation of the detection of aliasing effects is based on the 
assumption that the addresses of the vectors and matrices are uniquely identifying a specific vector or 


matrix. In the current implementation this is particularly true since both base classes of the Vector class 
are empty base classes and therefore don’t occupy any memory on their own in case the compiler supports 
the “Empty Base Class Optimization” [17]. However, this property is assured by several compile time 
constraints [9]. 

3.3 Integration of BLAS Functionality 

In order to achieve a maximum level of performance, the smart expression template formulation has to 
allow the expression dependent customization of the assignment. This would for instance enable the 
integration of BLAS functionality to specific operations. For the purpose of this section, let’s consider the 
multiplication between two dense matrices since this particular operation profits the most from the use 
of a performance optimized BLAS library. Listing 28 shows the assignment operator of the Matrix class 
that is very similar to the assignment operator of the Vector class (Listing 27): 

Listing 28: Smart expression template assignment operator of the Matrix class 
template < typename Type > // Type of the matrix elements 
template < typename MT > // Type of the right-hand side dense matrix 

inline Matrix <Type >& Matrix <Type >:: operator = ( const DenseMatrix <MT>& rhs ) 

{ 

using pe : : assign; 

if ( IsExpression<MT>: : value && ( ~ rhs ) . i s A1 iased ( this ) ) { 

MatrixMxN tmpC rhs ); 
swap ( tmp ) ; 

> 

else { 

resize ( (~ rhs ). rows () , ('rhs ). columns () ); 

assign ( *this , "rhs ); 


♦ this ; 


Depending on the result of the if statement that handles the detection of abasing effects, either a 
temporary is created and afterwards “assigned” via the “temporary swap” idiom [7, 6], or the matrix is 
resized and afterwards assigned the right-hand side dense matrix (expression). The assign function is 
the key to the expression dependent optimization of the assignment, assign is a free function expecting 
as arguments the left-hand side target matrix and the right-hand side dense matrix (expression) to be 
assigned to the left-hand side matrix. The default implementation of assign is illustrated in Listing 29: 

Listing 29: Default implementation of the free assign function 
template < typename MT1 // Type of the left-hand side dense matrix 
, typename MT2 > // Type of the right-hand side dense matrix 
inline void assignC DenseMatrix <MT1 >fc lhs , const DenseMatrix <MT2 >& rhs ) 

{ 

pe_INTERNAL_ASSERT ( ( ‘ lhs ). rows ( ) == (* rhs ). rows () , " Invalid u number u of u rows " ); 

pe_INTERNAL_ASSERT ( ( ~ lhs ). columns ( ) == (~ rhs ). columns () , " Inval id u number u of u columns " ); 

Clhs) . assignC rhs ); 


The assign function expects the target matrix to have the appropriate size for the assignment. Therefore 
the resize operation is performed before the assignment in the assignment operator. Afterwards, the 
default assignment strategy for a dense matrix is selected, which is depending on the implementation of 
the left-hand side dense matrix and therefore a member function of the left-hand side matrix. The strategy 
for the assignment is therefore a ping-pong approach: the assignment operator calls a free function, that 
allows an expression dependent customization (for instance via overloading or specialization) and in the 
default case the default assignment is selected, which is again a member function of the Matrix class: 

Listing 30: Default implementation of the assignment of a dense matrix 
template < typename Type > // Type of the matrix elements 
template< typename MT > // Type of the right-hand side dense matrix 

inline void Matrix <Type >:: assign ( const DenseMatrix <MT >& rhs ) 


for( size.t j=0; j<end; j +=2 ) { 
v_ [i*n_+j ] = ("rhs ) (i , j ); 

v_ [i*n_ + j +1] = ( " rhs ) (i , j +1 ) ; 

> 

if ( end < (~ rhs ). columns ( ) ) { 

v_[i*n_+end] = ( " rhs ) ( i , end ) ; 


The default assignment applies an element-wise strategy to assign the new values. However, this strat- 
egy is a bad choice for the evaluation of a matrix-matrix multiplication. Therefore it will be necessary to 
customize this strategy explicitly. Note that even this function allows a certain level of optimization by 
applying explicit loop unrolling. 

In order to optimize the assignment of a matrix-matrix multiplication, the expression class has to provide 
an according assign function. The following two code excerpts explain in detail how this optimization is 
implemented in case of the DMatDMatMultExpr class that represents the multiplication between two dense 
matrices: 


Listing 31: Smart expression object for the matrix-matrix multiplication 
template < typename MT1 // Type of the left-hand side dense matrix 
, typename MT2 > // Type of the right-hand side dense matrix 
class DMatDMatMultExpr : public DenseMatrix< DMatDMatMultExpr <MT1 , MT2 > > 

{ > P P 

private : 

// Result type of the left-hand side dense matrix expression 
typedef typename MT1 : : ResultType RT1 ; 

// Result type of the right-hand side dense matrix expression 
typedef typename MT2 :: ResultType RT2 ; 

// Composite type of the left-hand side dense matrix expression 
typedef typename MT1 : : CompositeType CT1 ; 

// Composite type of the right-hand side dense matrix expression 
typedef typename MT2 :: CompositeType CT2 ; 

// Result type for expression template evaluations 
typedef typename MathTrait <RT1 , RT2 > : : MultType ResultType; 

// Data type for composite expression templates 

typedef const DMatDMatMultExpr & CompositeType; 

// Resulting element type 

typedef typename ResultType : : ElementType ElementType ; 

// Member data type of the left-hand side dense matrix expression. 

typedef typename SelectType < IsExpres s ion <MT1 >:: value , const RT1 , CT1 > : : Type Lhs ; 

// Member data type of the right-hand side dense matrix expression. 

typedef typename Sele ctType < I sExpres s ion <MT2 >:: value , const RT2 , CT2 > : : Type Rhs; 

II ... 

inline const ElementType operator O ( size_t i, size.t j ) const; 
inline size.t rows () const; 

inline bool isAliased( const T* alias ) const; 

II ... 

Lhs lhs_; // Left-hand side dense matrix of the multiplication expression. 

Rhs rhs_; // Right-hand side dense matrix of the multiplication expression. 

II ... 


Just as any other expression class of the pe the DMatDMatMultExpr class defines the nested types 
ResultType, CompositeType, and ElementType. Additionally, it defines the two member data types 
for the left-hand and right-hand side operands Lhs and Rhs depending on the given types. This process 
is similar to the type evaluation of the DMatDVecMultExpr class, since also for a matrix-matrix multipli- 
cation the creation of temporary values in case either of the two operands is an expression increases the 
performance. As a dense matrix, the DMatDMatMultExpr class also provides a function call operator for 
the access to the matrix elements, a rows and a columns function, and the isAliased function to detect 
aliasing effects. 


Listing 32 shows the customized optimization of the assign function. For the purpose of providing 
a customized assign, the DMatDMatMultExpr class defines four nested friend functions. At the point of 
instantiation of the class, these function are injected into the surrounding namespace and are considered 
by the compiler during the selection of matching candidates for assign function calls (this approach is 
also known as the Barton-Nackman trick [16]): 


Listing 32: Assign functions for the evaluation of the matrix-matrix multiplication 


> // Type of the target dens 

:ign( DenseMatrix <MT>& lhs, c< 


DMatDMatMultExpr & rhs ) 


pe.INTERNAL. ASSERT! !~lhs) . rows C 
pe_INTERNAL. ASSERT! !~lhs) . columj 

if! rhs . lhs. . columns !) == 0 ) { 


DMatDMatMultExpr 


:emplate< typenan 
. typenan 
. typenan 


I // Type < 
t // Type ( 
> > // Type ( 
sign ! MT3& C, 


left-hand side target matrix 
i left-hand side matrix operand 
i right-hand side matrix operand 
MT4& A, const MT5& B ) 


II Default implementation of the i 


;emplate <typen£ 
semplate <typens 
semplate <typens 
.ne void i 


sign! ] 


II Type 
// Type 
// Type 


of the left-hand s: 
of the left-hand s: 
of the right -hand : 


MT5 < float >& I 


size.t M! A. rows!) ); 
size.t N! B. columns!) ); 
size.t K! A. columns!) ); 

.sgemm! CblasRowMaj or , CblasNoTrans , I 
&A !0 , 0) , K, &B !0 , 0) , M, 0 . OF , 


template <typens 
template <typen£ 
template <typen£ 


sign! MT3<double 


// Type 
II Type 
// Type 


of the left-hand s: 
of the left-hand s: 
of the right-hand ! 


M! A.: 


s!) 


; size.t N! B. columns!) ); 

; size.t K! A. columns!) ); 
i_dgemm! CblasRowMaj or , CblasNoTrai 
& A !0 , 0) , K, &B !0 , 0) , N, 0 


The first of the four assign functions is a template for an expression dependent optimization of the 
assignment of a specific expression. The function is an overload to the default assign function. In 


contrast to the default function, the second argument is not a template argument, but an instance of a 
DMatDMatMultExpr. In case a matrix-matrix multiplication is assigned to another dense matrix, this func- 
tion is selected to handle the assignment to the right-hand side matrix instead of the default function. In 
case of the DMatDMatMultExpr, this function acts as a dispatcher. It merely calls one of the three provided 
assign member functions of the DMatDMatMultExpr class. These functions are overloaded such that in 
case the two operands of the multiplication match the arguments expected from one of the two BLAS 
matrix-matrix multiplication functions cblas_sgemm or cblas_dgemm, the according member function is 
called and the BLAS functions are executed. In case the BLAS functions cannot be used, the default 
performance optimized assign function is called. In contrast to the default assign, which employs an 
element-wise strategy for the assignment, this function implements a cache optimized calculation of the 
matrix values. 

Note that the same strategy can also be used for addition assignments and subtraction assignments: 

Listing 33: Addition and subtraciton assignment of a matrix-matrix multiplication 
i Matrix <double > A, B, C, D; 

3 C += A * B; 

4 D -= A * B; 

The according functions are called addAssign and subAssign, respectively, and follow the same imple- 
mentation approach as the assign function family. 


4 Performance Results 

In this section, the performance of the smart expression templates as implemented in the math library 
of the pe are examined in more detail. The focus of this section is the performance of composite expres- 
sions. Appendix C gives a complete overview of possible operations between dense and sparse vectors 
and matrices. The pe performance results are compared to the performance of the uBLAS library of the 
Boost framework. This library also offers an expression template based implementation of several dense 
and sparse vector and matrix data structures. However, so far Boost is primarily focused on usability 
and maintainability, and on providing a rich set of features for all kinds of mathematical operations. In 
contradiction to the name of the library, it does not yet support any high performance BLAS operations 6 . 

Listing 34 shows a code excerpt for the performance benchmark of the complex expression (A + B) ■ 
(C — D) using the uBLAS library. Prior to the performance measurement, all matrices are accordingly 
sized and initialized. Afterwards, a certain number of repetitions are executed, each time measuring the 
passed wall clock time with the timing functionality of the pe. Each operation is executed several times 
to guarantee runtimes of several seconds per measurement. Additionally, special care is taken to prevent 
the compiler to optimize the calculations away. 

Listing 34: Performance benchmark for the dense matrix-dense matrix multiplication using Boost uBLAS 

1 boost : : numeric : : ublas : : mat r ix < double > A( N, N ), B( N, N ), C( N, N ), D( N, N ), E( N, N ) 

2 pe :: timing :: WcTimer timer; 

5 // ... 

7 // Initial calculation of the matrix E 

s noalias ( E ) = prod( A + B, C - D ); 

0 for( std::size_t rep=0; rep<reps; ++rep ) { 

1 timer . start () ; 

2 for( std::size_t step=0; step<steps; ++step ) { 

3 noalias ( E ) = prod( A + B, C - D ); 

4 > 

5 timer. end (); 


The test machine for all performance tests was an Intel Core i7 940 CPU at 2.93 GHz (Bloomfield core) 
with 8 MByte of shared level three cache. All executables are compiled with the GNU G++ 4.4.1 compiler 


6 Although this might change with any 


release of the Boost library. 


(branch revision 150839). The data types of the vector and matrix elements is double for all performance 
tests. We used Boost 1.39 for all performance comparisons and the used BLAS library is ATLAS in the 
version 3.9.17. 

4.1 Complex Expression A ■ (a + b) 

Listing 35: Implementation of the expression A ■ (a + b) in the pe 
II Definition of a double precision dense square matrices 
// and three double precision dense vectors 
pe : : MatN A( N , N ) ; 
pe : : VecN a( N ) , b ( N ), c( N ); 

// Initialization 

II ... 

II Evaluation of the expression 
c = A * ( a + b ); 

The complex expression A ■ (a + b) as already used in the motivation part benefits from the smart 
expression template formulation due to the intermediate evaluation of the vector expresssion a + b into a 
temporary object. In comparison to the uBLAS library, the smart expression template formulation of the 
pe in combination with the BLAS cblas_dgemv function for double precision matrix- vector multiplications 
performs better by a factor of 1.41 for small matrices and vectors (N = 50) and by a factor of 1.47 for 
large matrices and vectors ( N = 1000). The major performance improvement in this case results from 
the creation of the temporary. The additional application of a performance optimized BLAS library only 
slightly improves the performance. 



4.2 Complex Expression A ■ (a + b + c) 


Listing 36: Implementation of the expression A ■ (a + b + c) in the pe 
II Definition of a double precision dense square matrices 
II and four double precision dense vectors 
pe : : MatN A( N , N ) ; 

pe : : VecN a( N ) , b ( N ), c( N ) , d ( N ); 

II Initialization 
U ... 

II Evaluation of the expression 
d = A * (a+b+c); 

For the multiplication of a matrix with a vector expression representing the addition of three vectors, 
the performance gain of the smart expression template formulation is even more prominent. In comparison 
to the uBLAS library, the pe performs better by a factor of 1.84 for small operands ( N = 50) and by a 
factor of 1.88 for large operands ( N = 1000). Again, the application of a performance optimized BLAS 
library only slightly improves the performance in comparison to the performance gain from the smart 
expression template formulation. 



Figure 5: Performance comparison between the pe and Boost for the complex expression A ■ (a + b + c). 


4.3 Complex Expression (A ■ B) ■ (a + b ) 


Listing 37: Implementation of the expression (A ■ B) ■ (a + b) in the pe 
II Definition of two double precision dense square matrices 
// and three double precision dense vectors 
pe : : MatN A(N,N),B(N,N); 
pe : : VecN a( N ) , b ( N ), c( N ); 

// Initialization 

II ... 

// Evaluation of the expression 
c=(A*B)*(a+b); 

The third complex expression represents a matrix-vector multiplication between a dense matrix and a 
dense vector, where the dense matrix is the result of a matrix-matrix multiplication and the dense vector 
results from the addition of two dense vectors. Although even without usage of the optimized BLAS 
functions the pe is faster than the BOOST uBLAS library by a factor of 1.15 for small operands (N = 50) 
and by a factor of 1.11 for larger operands (N = 400), the main performance improvement results from 
the ATLAS library. Note that although each element of the matrix resulting from the matrix-matrix 
multiplication is required only once during the subsequent matrix-vector multiplication, the creation of a 
temporary matrix for the result in order to be able to apply the BLAS functionality is very beneficial for 
the overall performance of the expression. Also note that the performance gain is higher the larger the 
operands of the expressions are. 



Figure 6: Performance comparison between the pe and Boost for the complex expression (A - B) • (a + b). 


4.4 Complex Expression (A ■ B) + C 


Listing 38: Implementation of the expression (A - B) + C in the pe 
II Definition of four double precision dense square matrices 
pe : : MatN A ( N , N ), B( N, N ), C(N, N ), D ( N , N); 

II Initialization 

II ... 

II Evaluation of the expression 
D = ( A * B ) + C; 

Also the fourth complex expression (A - B) + C greatly profits from the performance optimized BLAS 
functions. Whereas without the BLAS functions, the pe and uBLAS exhibit a similar performance, the 
application of ATLAS greatly improves the performance, especially for large operands. Also in this case, 
the performance gain is only possible by creating a temporary matrix for the result of the matrix-matrix 
multiplication in order to be able to apply the BLAS functionality. 




Figure 7 : Performance comparison between the pe and Boost for the complex expression (A ■ B) + C. 


4.5 Complex Expression (A + B) ■ (C — D) 


Listing 39: Implementation of the expression (A + B) ■ (C — D) in the pe 
// Definition of five double precision dense square matrices 
pe : : MatN A ( N , N), B( N, N ), C ( H , N), D ( N , N), E( N, N); 

II Initialization 
/-/ ... 


II Evaluation of the expression 
E = (A + B) * C C - D ) ; 

The complex expression (A + B) ■ (C — D) impressively demonstrates the advantage of smart expression 
templates for the evaluation of composite expressions. The matrix addition A + B is multiplied with 
the result of the matrix subtraction C — D. Since for the matrix-matrix multiplication all elements from 
both matrices are required several times, both operands are evaluated into temporary matrices. Only this 
approach already results in a performance difference of a factor of 1.48 for small matrices (N = 50) and 
a factor of 14.36 for large matrices (N = 500) in favor of the pe in comparison to Boost uBLAS. In case 
ATLAS is used, another factor of 3.52 for small matrices and 4.78 for large matrices is gained. In total, 
this results in a performance difference of a factor of 68.62 between the pe and uBLAS. 



Figure 8: Performance comparison between the pe and Boost for the complex expression (A + B) ■ (C — D). 


5 Conclusion 

In this report we introduced the smart expression template programming technique used in the math 
library of the pe rigid multibody physics engine. In comparison to standard expression templates as they 
are used to circumvent the performance degrading characteristics of the C++ technique of operator over- 
loading they offer an expression dependent evaluation and optimization including the possible integration 
of BLAS functionality, and the automatic detection of aliasing effects. We presented an in depth explana- 
tion of the implementation details of smart expression templates and gave an overview of the extensions 
necessary to update standard expression templates. In order to demonstrate the achieved performance of 
this approach, we compared the performance of several complex expresssions between the pe math library 
and the Boost uBLAS library. In comparison to uBLAS, which offers similar features as the pe math 
library and is also based on expression templates, smart expression templates in combination with BLAS 
may result in tremendous performance improvements depending on the given expression. 

A The SelectType class template 

The SelectType class template is a template metaprogramming class to select one of two given types 
depending on a boolean compile time constant expression. The implementation is based on a class template 
and a single specialization: 




Listing 40: Implementation of the SelectType class 
template < bool Select // Compile time selection 

, typename T1 // Type to be selected if Select=true 

, typename T2 > // Type to be selected if Select=false 

{ yp 

typedef T1 Type ; // The selected type . 


template < typename T1 // Type not to be selected 
, typename T2 > // Type to be selected 
struct SelectType <false ,T1 ,T2> 

{ 

typedef T2 Type ; // The selected type . 


In case the first given template argument Select is true, the base template is selected, which has a 
nested type definition for the second template argument (the first of the two given types to select from). 
In case Select is false, the specialization is selected, which defines a nested type for the third template 
argument (the second of the two given types). Note that this class merely selects one of two given types. 
It is therefore not necessary for the compiler to instantiate any of the given types. 


B The MathTrait class template 

The MathTrait class template offers the possibility to select the resulting data type of a generic math- 
ematical operation. Listing 41 demonstrates the use of MathTrait for the addition of two values. Via 
MathTrait it is possible to specify the return type of the add function without knowledge about the two 
given types T1 and T2: 


Listing 41: Application of the MathTrait class template 


template < typename T1 

, typename T2 > 

typename MathTrait<Tl,T2>:: AddType 
add ( T1 ti , T2 t2 ) 

{ 


// The type of the left operand 
// The type of the right operand 
// The resulting generic return type 
// 

// The function J add ’ returns the sum 
// of the two given values 


Per default, the MathTrait template provides specializations for all built-in data types (except void 
and bool, which are both not considered numeric data types, but instead including std: :size_t and 
std: :ptrdiff_t for several compilers) and all mathematical classes of the pe. Specifying the resulting 
data type for a specific operation is done by specializing the MathTrait template for this particular type 
combination. In case a certain type combination is not defined in a MathTrait specialization, the base 
template is selected, which is left undefined and therefore stops the compilation process: 

Listing 42: Undefined base template of the MathTrait class 
template < typename Tl , typename T2 > 
struct MathTrait; 


Each specialization of MathTrait defines the data types HighType that represents the high-order data 
type of the two given data types and LowType that represents the low-order data type. Additionally, each 
specialization defines the types AddType, SubType, MultType, and DivType, that represent the type of 
the resulting data type of the corresponding mathematical operation. The following example shows the 
specialization for operations between the double and the integer type: 


Listing 43: Specialization of the MathTrait class template 


struct MathTrait < 

typedef double 
typedef int 
typedef double 
typedef double 
typedef double 
typedef double 


HighType; 
LowType ; 
AddType ; 
SubType ; 
MultType ; 
DivType ; 


In case of operations between built-in data types, the MathTrait class defines the more significant data 
type as the resulting data type. For this selection, signed data types are given a higher significance. 
It is also possible to specialize the MathTrait template for additional user-defined data types, such as 
vectors and matrices. However, it is possible that a specific mathematical operation is invalid for the 
particular type combination. In this case, the INVALID.NUMERICAL.TYPE can be used to fill the missing 
type definition. The INVALID_NUMERICAL_TYPE represents the resulting data type of an invalid numerical 
operation. It is left undefined to stop the compilation process in case it is instantiated. The following 
example shows the specialization of the MathTrait template for MatrixMxN and VectorN. In this case, 
only the multiplication between the matrix and the vector is a valid numerical operation. Therefore for 
all other types the INVALID.NUMERICAL.TYPE is used. 


Listing 44: Specialization of the MathTrait class template 

1 template < typename T1 , typename T2 > 

2 struct MathTrait < MatrixMxN <T1 > , VectorN<T2> > 

3 { 

4 // Invalid, no common high data type 

5 typedef INVALID.NUMERICAL.TYPE HighType ; 

r // Invalid, no common low data type 

8 typedef INVALID.NUMERICAL.TYPE LowType ; 

0 // Invalid, cannot add a matrix and a vector 

1 typedef INVALID.NUMERICAL.TYPE AddType ; 

3 // Invalid, cannot subtract a vector from a matrix 

4 typedef INVALID.NUMERICAL.TYPE SubType ; 

6 // Multiplication between a matrix and a vector 

7 typedef VectorN< typename MathTrait <T1 , T2 >:: MultType > MultType ; 


// Invalid , cannot divide a matrix by a vector 

typedef INVALID.NUMERICAL.TYPE DivType ; 


Note the recursive instantiation of MathTrait for the definition of the data type resulting from a 
multiplication between a 3 x 3 matrix and a three-dimensional vector. Both classes are defined as templates 
to enable arbitrary data types as elements: 

Listing 45: Declarations of the Matrix and Vector classes 

1 template < typename T > class MatrixMxN; 

2 template < typename T > class VectorN; 

Therefore it is possible to combine matrices and vectors with different element types. In the simplest 
case, a matrix of double values might be multiplied with a vector of int values. In this case, the resulting 
data type would be a three-dimensional vector of double values: 

Listing 46: Mixed-type matrix-vector multiplication 

1 MatrixMxN <double> A; 

2 VectorN<int> v; 

4 VectorN <double > x = A * v; 


A more complex example might involve a matrix of matrices of float values and a vector of vectors of 
long values: 


Listing 47: Mixed-type matrix- vector multiplication 

1 MatrixMxN < MatrixMxN <float > > A; 

2 VectorN < VectorN <long > > v; 

4 VectorN < VectorN <float > > x = A * v; 


Still the operation is well defined and the correct return type of the multiplication can be evaluated by 
the recursive use of the MathTrait class template. 


C Further Performance Results 


This section contains further performance comparisons between the pe and Boost uBLAS. It shows a com- 
plete evaluation of all possible operations between dense and sparse vectors and matrices to demonstrate 
the suitability of the smart expression template approach. For the measurements, the following data 
types from the Boost uBLAS library were used: vector for dense vectors, matrix for dense matrices, 
compressed_vector for sparse vectors, and compressed_matrix for sparse matrices. The according data 
types of the pe are VectorN for dense vectors, MatrixMxN for dense matrices, SparseVectorN for sparse 
vectors, and SparseMatrixMxN for sparse matrices. The test machine for all performance tests was an 
Intel Core i7 940 CPU at 2.93 GHz (Bloomfield core) with 8 MByte of shared level three cache. All 
executables are compiled with the GNU G++ 4.4.1 compiler (branch revision 150839). 

A general observation of the performance results is that the pe always performs better than the Boost 
uBLAS library (except for the dense matrix-dense matrix multiplication with small matrices). For all 
operations between dense data structures, the performance difference between pe and uBLAS is either very 
similar or within reasonable bounds. However, all operations involving a sparse data structure exhibit 
a tremendous performance advantage for the pe. The reason for this is based in the general expression 
template formulation of uBLAS as well as in the optimization efforts of the pe. 

C.l Dense Vector- Dense Vector Addition/Subtraction 



Figure 9: Performance comparison between the pe and Boost for the dense vector dense vector addition/subtrac- 


C.2 Dense Vector-Sparse Vector Addition/Subtraction 



Figure 10: Performance comparison between the pe and Boost for the dense vector sparse vector addition/sub- 
traction. 


C.3 Sparse Vector-Dense Vector Addition/Subtraction 



Figure 11: Performance comparison between the pe and Boost for the sparse vector dense vector addition/sub- 
traction. 


C.4 Sparse Vector-Sparse Vector Addition/Subtraction 



Figure 12: Performance comparison between the pe and Boost for the sparse vector sparse vector addition/sub- 
traction. 


C.5 Dense Vector-Scalar Multiplication 



Figure 13: Performance comparison between the pe and Boost for the multiplication of a dense vector and a 
scalar. 


C.6 Sparse Vector-Scalar Multiplication 
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Figure 14: Performance comparison between the pe and Boost for the multiplication of a sparse vector and a 
scalar. 


C.7 Dense Matrix-Dense Vector Multiplication 





Figure 15: Performance comparison between the pe and Boost for the dense matrix dense vector multiplication. 






C.8 Dense Matrix-Sparse Vector Multiplication 



Figure 16: Performance comparison between the pe and Boost for the dense matrix sparse vector multiplication. 


C.9 Sparse Matrix-Dense Vector Multiplication 



Figure 17: Performance comparison between the pe and Boost for the sparse matrix dense vector multiplication. 


C.10 Sparse Matrix-Sparse Vector Multiplication 



Figure 18: Performance comparison between the pe and Boost for the sparse matrix sparse vector multiplication. 


C.ll Dense Matrix-Dense Matrix Addition/Subtraction 



Figure 19: Performance comparison between the pe and Boost for the dense matrix dense matrix addition/sub- 
traction. 


C.12 Dense Matrix-Sparse Matrix Addition/Subtraction 


Figure 20: Performance comparison between the pe and Boost for the dense matrix sparse matrix addition/sub- 
traction. 







C.13 Sparse Matrix-Dense Matrix Addition/Subtraction 



Figure 21: Performance comparison between the pe and Boost for the sparse matrix dense matrix addition/sub- 
traction. 


C.14 Sparse Matrix-Sparse Matrix Addition/Subtraction 



Figure 22: Performance comparison between the pe and Boost for the sparse matrix sparse matrix addition/sub- 
traction. 


C.15 Dense Matrix-Scalar Multiplication 



Figure 23: Performance comparison between the pe and Boost for the multiplication of a dense matrix and 
scalar. 


C.16 Sparse Matrix-Scalar Multiplication 



Figure 24: Performance comparison between the pe and Boost for the multiplication of a sparse matrix and 
scalar. 


C.17 Dense Matrix-Dense Matrix Multiplication 



Figure 25: Performance comparison between the pe and Boost for the dense matrix dense matrix multiplication. 





C.18 Dense Matrix-Sparse Matrix Multiplication 



Figure 26: Performance comparison between the pe and Boost for the dense matrix sparse matrix multiplication. 


C.19 Sparse Matrix-Dense Matrix Multiplication 



Figure 27: Performance comparison between the pe and Boost for the sparse matrix dense matrix multiplication. 


C.20 Sparse Matrix-Sparse Matrix Multiplication 
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Figure 28: Performance comparison between the pe and Boost for the sparse matrix sparse matrix multiplication. 
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